Systems and methods for a privacy preserving text representation learning framework

ABSTRACT

Various embodiments of a computer-implemented system which learns textual representations while filtering out potentially personally identifying data and retaining semantic meaning within the textual representations are disclosed herein.

CROSS-REFERENCE TO RELATED APPLICATIONS

This is a non-provisional application that claims benefit to U.S.Provisional Patent Application Ser. No. 63/018,287 filed Apr. 30, 2020,which is herein incorporated by reference in its entirety.

GOVERNMENT SUPPORT

This invention was made with government support under W911NF-15-1-0328awarded by the Army Research Office, under 1614576 awarded by theNational Science Foundation and under N00014-17-1-2605 awarded by theOffice of Naval Research. The government has certain rights in theinvention.

FIELD

The present disclosure generally relates to natural language processing;and in particular, to a computer-implemented system and method forlearning textual representations of user-generated textual informationwhich preserves semantic meaning while removing potential personalinformation.

BACKGROUND

Textual information is one of the most significant portions of data thatusers generate by participating in different online activities such asleaving online reviews and posting tweets. On one hand, textual dataincludes abundant information about users' behavior, preferences andneeds, which is critical for understanding them. For example, textualdata has been historically used by service providers to track users'responses to products and provide the user with personalized services.On the other hand, publishing intact user-generated textual data makesusers vulnerable against privacy issues. The reason is that the textualdata itself includes sufficient information that causes there-identification of users in the textual database and the leakage oftheir private attribute information.

These privacy concerns mandate data publishers to protect users' privacyby anonymizing the data before sharing it. However, traditional privacypreserving techniques such as k-anonymity and differential privacy areinefficient for user-generated textual data because this data is highlyunstructured, noisy and unlike traditional documental content, caninclude large amounts of short and informal posts. Moreover, thesesolutions may impose a significant utility loss for protecting textualdata as they may not explicitly include utility into their designobjectives. It is thus challenging to design effective anonymizationtechniques for user-generated textual data which preserve both privacyand utility.

It is with these observations in mind, among others, that variousaspects of the present disclosure were conceived and developed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing an architecture for acomputer-implemented text representation learning system;

FIG. 2 is a block diagram showing an auto-encoder of the system of FIG.1;

FIG. 3 is a block diagram showing a semantic discriminator of the systemof FIG. 1;

FIG. 4 is a block diagram showing a private attribute discriminator ofthe system of FIG. 1;

FIG. 5 is a flowchart showing a process flow for optimizing the textrepresentation learning system of FIG. 1;

FIG. 6 is a flowchart showing a process flow for iteratively trainingthe system of FIG. 1 to learn an amount of noise to add to a textrepresentation;

FIG. 7A is a graph showing private attribute prediction with respect tosentiment prediction (F1) for different contribution values of a privateattribute discriminator of the text representation learning system ofFIG. 1;

FIG. 7B is a graph showing sentiment prediction accuracy for differentcontribution values of a private attribute discriminator of the textrepresentation learning system of FIG. 1;

FIG. 7C is a graphical representation showing private attributeprediction with respect to part-of-speech tagging for differentcontribution values of a private attribute discriminator of the textrepresentation learning system of FIG. 1;

FIG. 7D is a graphical representation showing part-of-speech taggingaccuracy for different contribution values of a private attributediscriminator of the text representation learning system of FIG. 1; and

FIG. 8 is a simplified diagram showing an example device forimplementation of the framework of FIG. 1.

Corresponding reference characters indicate corresponding elements amongthe view of the drawings. The headings used in the figures do not limitthe scope of the claims.

DETAILED DESCRIPTION

Various embodiments of a framework for learning text representations ofa document while maximizing semantic meaning and minimizing privateattributes within text representations are disclosed herein. In someembodiments, the framework includes an auto-encoder for learning a textrepresentation of a document, a differential-privacy-based noise adderfor adding noise to the text representation, and semantic and privateattribute discriminators to optimize the differential-privacy-basednoise adder to ensure that semantic meaning is retained by the textrepresentation while obfuscating private attributes. Referring to thedrawings, embodiments of the system are illustrated and generallyindicated as 100 in FIGS. 1-8.

Referring to FIGS. 1-4, a double privacy preserving text representationlearning framework 100, also referred to as DPText, is disclosed herein.The framework 100 learns a modified latent representation 124 for adocument 10 that (1) is differentially private and thus protects usersagainst leakage of their identity, (2) obscures users' privateinformation (e.g., age, location, gender), and (3) retains high utilityand semantic meaning for a given task. The present framework 100includes four main components, 1) an auto-encoder 102 (FIG. 2), 2)differential-privacy-based noise adder 104, 3) a semantic meaningdiscriminator 106 (FIG. 3), and 4) a private attribute discriminator 108(FIG. 4). It is further theoretically shown herein that a resultantmodified latent representation 124 is differentially private. Theeffectiveness of the present framework 100 is also shown on real-worlddatasets in two important natural language processing tasks, i.e.,sentiment prediction and part-of-speech (POS) tagging. The theoreticaland empirical results show the effectiveness of DPText in minimizingchances of learned textual representation re-identification, obscuringprivate attribute information and preserving semantic meaning of thetext.

Referring to FIG. 1, a document 10 can include textual information foranalysis by the framework 100. The framework 100 includes anauto-encoder 102 configured to extract an initial latent representationz 122 for the document 10. The auto-encoder 102 extracts the initiallatent representation z 122 and further minimizes a reconstruction errorbetween the initial latent representation z 122 and the textualinformation within the document 10 itself. Once the initial latentrepresentation z 122 is obtained, a differential privacy adder 104 isdeployed along with associated semantic meaning discriminators 106 andprivate attribute discriminator 108 to add random noise, i.e., Laplaciannoise, to the initial latent representation 122 with respect to a givenprivacy budget, denoted herein as ϵ.

If one were to publish a text representation without properanonymization, an adversary can learn the original text or infer if atargeted user's latent textual representation is in the database orwhich record is associated with it. Besides guaranteeing differentialprivacy, the act of adding noise minimizes the chance of the textre-identification and original text recovery. However, simply addingnoise to the initial latent representation z 122 not only may destroythe semantic meaning of the text, but also does not necessarily preventleakage of private attribute information from the text information onits own. Semantic meaning of the text data is task-dependent. Forexample, for sentiment analysis, sentiment is one of the semanticmeanings of the given text and sentiment prediction is a classificationtask. Private-attribute information is also another important aspect ofuser privacy and includes information that the user does not want todisclose such as age, gender, and location.

It is therefore necessary to add an optimal amount of noise s to theoriginal latent representation z 122. This challenge is approached bylearning an amount of the added noise s using the privacy budget ϵ. Asshown, the semantic meaning discriminator D_(S) 106 and the privateattribute discriminator D_(P) 108 are also utilized to infer the amountof noise s to be added to the original latent representation z 122 bydifferential privacy adder 104. The semantic meaning discriminator D_(S)106 ensures that the noise added by differential privacy adder 104 doesnot destroy the semantic meaning with respect to a given task. Theprivate attribute discriminator D_(P) 108 guides the amount of noise sadded by differential privacy adder 104 by ensuring that a resultantmodified latent representation 124 does not include users' privateinformation.

To incorporate the two discriminators D_(S) 106 and D_(P) 108 intodetermining an optimal amount of noise, an objective function is modeledas a minmax game among the two introduced discriminators, D_(S) 106 andD_(P) 108. Assume that there are T private attributes in the document10. Let θ_(D) _(P) _(t) and θ_(D) _(S) respectively demonstrateparameters of the private-attribute discriminator D_(P) 108 and thesemantic meaning discriminator D_(S) 106. Correct labels for the t-thsensitive attribute and semantic classification task in the n-thdocument are also represented by p_(n,t) and y_(n), respectively. With Ndocuments, an objective function is written as follows:

$\begin{matrix}{{{\min\limits_{\theta_{D_{S}},\epsilon}{\max\limits_{{\{\theta_{D_{P}^{t}}\}}_{t = 1}^{T}}{\frac{1}{N}{\sum\limits_{n = 1}^{N}{\mathcal{L}_{\mathcal{D}_{S}}\left( {{\hat{y}}_{n},y_{n}} \right)}}}}} - {\alpha\frac{1}{T}{\sum\limits_{t = 1}^{T}{\mathcal{L}_{\mathcal{D}_{P}^{t}}\left( {{\hat{p}}_{n,t},p_{n,t}} \right)}}} + {{{\lambda\Omega}(\theta)}\mspace{14mu}{s.t.\mspace{14mu}\epsilon}}} \leq c_{1}} & (1)\end{matrix}$

where c₁ is a predefined privacy budget constraint,

_(D) _(P) _(t) and

_(D) _(S) denote cross entropy loss function, {circumflex over (p)}_(t)is a predicted t-th private attribute, ŷ is a predicted semantic, andΩ(θ) is a parameter regularizer. θ={θ_(D) _(S) , ∈, {θ_(D) _(P) _(t)}_(t=1) ^(T)} is a set of all parameters to be learned includingparameters of the semantic meaning discriminator model D_(S) 106 andprivate-attribute discriminator model D_(P) 108 and the privacy budgetϵ. Note that the resultant modified latent representation {tilde over(z)}=z+∈ 124 satisfies {tilde over (∈)}-differential privacy, where{tilde over (∈)}≤c₁ is an optimal learned budget.

Problem Statement

Let χ={x₁ . . . , x_(N)} denote a set of N documents and

={p₁, . . . , p_(N)} denotes a set of T private and sensitiveattributes. Each document x_(i) 10 includes a sequence of words, i.e.,x_(i)={x_(i) ¹, . . . x_(i) ^(m)}. z_(i)∈

^(d×1) is denoted as the context representation 122 of the originaldocument x_(i) 10. The framework 100 aims to preserve users' privacy bypreventing a potential adversary from inferring whether a target textrepresentation is in the dataset or which record is associated with itor being able to learn the target users' private attribute information.

PROBLEM 1. Given a set of documents χ, set of sensitive attributes

, and given task T, learn a function f that can generate and release themodified latent representation {tilde over (z)}_(i) 124 for eachdocument x_(i) so that, 1) the adversary cannot re-identify a targetedtext representation and infer whether or not this latent representationis in the database, 2) the adversary cannot infer the targeted user'sprivate attributes

from the modified latent representation {tilde over (z)}_(i) 124 and 3)the modified latent representation {tilde over (z)}_(i) 124 is good forthe given task

, i.e., {tilde over (z)}_(i)=f(x_(i),

).

Differential Privacy Overview

Differential privacy protects a user's privacy during statistical queryover a database by minimizing the chance of privacy leakage whilemaximizing the accuracy of queries. Differential privacy provides astrong privacy guarantee. The intuition behind differential privacy isthat the risk of user's privacy leakage should not increase as a resultof participating in a database. Differential privacy guarantees thatexistence of an instance in the database does not pose a threat to itsprivacy as the statistical information of data would not changesignificantly in comparison to the case that the instance is absent.This makes it challenging for an adversary to re-identify an instanceand infer whether the instance is in the database or not or decide whichrecord is associated with it. An algorithm with privacy property isdenoted by

_(p), which is randomized so that the re-identification of the data onthe adversary's side is very difficult. Differential privacy can beformally defined:

DEFINITION 1. ∈—Differential Privacy. An algorithm

_(p) is ∈-differential private if for any subset of outputs

and for all datasets

₁ and

₂ differing in at most one element:

$\begin{matrix}{\frac{{\mathbb{P}}\left( {{\mathcal{A}_{p}\left( \mathcal{D}_{1} \right)} \in R} \right)}{{\mathbb{P}}\left( {{\mathcal{A}_{p}\left( \mathcal{D}_{2} \right)} \in R} \right)} \leq e^{\epsilon}} & (2)\end{matrix}$

where

_(p)(

₁) and

_(p)(

₂) are the outputs of the algorithm for input datasets

₁ and

₂ respectively and

is the randomness of the noise in the algorithm

Here ∈ is called privacy budget and it can be also shown that Eq. 2 isequivalent to

${{\log\left( \frac{P\left( {\mathcal{A}_{p}\left( {\mathcal{D}_{1} = r} \right)} \right.}{P\left( {\mathcal{A}_{p}\left( {\mathcal{D}_{2} = r} \right)} \right.} \right)}} \leq \epsilon$

for some point r in the output range. Note that larger values of ∈(e.g., 10) results in larger privacy loss while smaller values (e.g.,∈≤0.1) indicate the opposite. For example, a small ∈ means that theoutput probabilities of

₁ and

₂ at r are very similar to each other which demonstrates more privacy.An uncertainty should be introduced in the output of a function (i.e.,algorithm) to be able to hide the participation of an individual in thedatabase. This is quantified by sensitivity, which is the amount of thechange in the output of function

made by a single data point in the worst case

Definition 2.

₁-sensitivity. The

₁-sensitivity of a vector-valued function

is the maximum change in the

₁ norm of the value of the function

when one input changes. More formally, the

₁-sensitivity Δ(

) if

is defined as

$\begin{matrix}{{\Delta(\mathcal{A})} = {\underset{{{\mathcal{X},\mathcal{X}^{\prime}}} = 1}{\max\limits_{\mathcal{X},\mathcal{X}^{\prime}}}\;{{{\mathcal{A}(\mathcal{X})} - {\mathcal{A}\left( \mathcal{X}^{\prime} \right)}}}_{1}}} & (3)\end{matrix}$

where χ and χ′ are two datasets differ in one entry.

Framework Details and Construction

Referring again to FIGS. 1 and 2, the details of the double privacypreserving text representation learning framework 100 are discussedherein. This framework 100 includes four major components: 1) anauto-encoder 102 for text representation, 2) differential-privacy-basednoise adder 104, 3) a semantic meaning discriminator 106, and 4) aprivate attribute discriminator 108. The auto-encoder 102 aims to learna content representation z 122 of a document 10 by minimizing areconstruction error between the latent representation z 122 and thetext of the document 10. Then, the differential-privacy-based noiseadder 104 adds random noise, i.e., Laplacian noise, to the initiallatent representation z 122 with respect to privacy budget ∈ to furthersatisfy the differential privacy guarantee. Since adding noise neitherguarantees semantic meaning preservation nor necessarily preventsleakage of private attributes, the semantic meaning and privateattribute discriminators 106 and 108 are utilized to infer an optimalamount of added noise s. The semantic meaning discriminator

_(S) 106 ensures that the added noise does not destroy the semanticmeaning with respect to a given task. The private attributediscriminator

_(P) 108 also guides the amount of added noise by ensuring that themanipulated representation does not include users' private information.Note that it is assumed that the framework 100 is trusted and thereforeeverything to the left of the privacy barrier (the red dashed line inFIG. 1) including the original textual information and intermediateresults, are kept private. The final modified latent representation{tilde over (z)}_(i) 124 which is to the right of the privacy barrier isreleased to the public. The final modified latent representation {tildeover (z)}_(i) 124 1) is differentially private, 2) obscures privateattribute information, and 3) preserves semantic meaning.

Content Representation Extraction

Referring to FIGS. 1 and 2, let x={x¹, . . . , x^(m)} be a textualdocument 10 with m words where each word is from a fixed vocabulary setV with size |ν|=K. The auto-encoder A 102 is used to extract contentrepresentation z 122 from document x 10. Let E_(A):χ→

be an encoder 141 that can infer the latent representation z 122 for agiven document x 10, and D_(A):

→χ be a decoder that reconstructs the document 10 from its initiallatent representation z 122.

Recurrent neural networks (RNNs) are effective for summarizing andlearning semantics of unstructured noisy short texts. In one embodiment,an encoder 141 is built from a first RNN to learn the initial latentrepresentation z 122 of texts. The encoder 141 can learn a probabilitydistribution over a sequence when trained to predict the next symbol ina sequence. The encoder 141 includes a hidden state S and an optionaloutput which operates on a word sequence x={x¹, . . . , x^(m)}. At eachtime step t, the hidden state s_(t) of the encoder 141 is updated by:

$\begin{matrix}{s_{t} = {{f_{enc}\left( {s_{t - 1},x^{t}} \right)}.}} & (4)\end{matrix}$

After reaching the end of the given document 10, the last hidden stateof the encoder 141 is used as the latent representation z∈

^(d×1) 122 of the document x 10. A gated recurrent unit (GRU) is used asthe cell type to build the encoder 141, which is designed in a manner tohave a more persistent memory. Let θ_(e) denote parameters for theencoder E_(A) 141. Then:

z=E _(A)(x,θ _(e))  (5)

Decoder {circumflex over (x)}=D_(A)(z, θ_(d)) 142 serves as a check forencoder 141 and takes the initial latent representation z 122 found byencoder 141 as input to start the generation process. θ_(d) denotesparameters for the decoder D_(A) 142, which is built using a second RNN.The decoder D_(A) 142 generates an output word sequence {circumflex over(x)}={{circumflex over (x)}¹, . . . , {circumflex over (x)}^(m)}. Ateach time step t, a hidden state of the decoder 142 is computed as:

s _(t) =f _(dec)(s _(t-1) ,{circumflex over (x)} ^(t))  (6)

where s₀=z. The word at step t is predicted using a softmax classifier:

$\begin{matrix}{{\overset{\hat{}}{x}}^{t} = {{softmax}\left( W^{{(S)}_{S_{t}}} \right)}} & (7)\end{matrix}$

Where softmax(.) is a softmax activation function, W^((S))∈

^(|ν|×(d+k)) with d+k as the dimension of the hidden state in eachlayer, and {circumflex over (x)}^(t)∈

^(|ν|) is a probability distribution over the vocabulary. Here V denotesa fixed vocabulary set with size |ν|=K. {circumflex over (x)}^(t,j) isdefined as the probability of choosing j-th word v_(j)∈ν as:

{circumflex over (x)} ^(t,j) =p({circumflex over (x)} ^(t) =v _(j)|{circumflex over (x)} ^(t-1) ,{circumflex over (x)} ^(t-2) , . . .,{circumflex over (x)} ¹)  (8)

The probability of generating an output sequence {circumflex over(x)}={{circumflex over (x)}¹, . . . , {circumflex over (x)}^(m)} giventhe input document x is:

$\begin{matrix}{{p\left( {\left. \overset{\hat{}}{x} \middle| x \right.,\theta_{d}} \right)} = {\prod\limits_{t = 1}^{t = m}{p\left( {{{\hat{x}}^{t}{\hat{x}}^{t - 1}},{\hat{x}}^{t - 2},\ldots\mspace{14mu},{\hat{x}}^{1},z,\theta_{d}} \right)}}} & (9)\end{matrix}$

The encoder 141 and decoder 142 of the auto-encoder 102 of the framework100 are jointly trained to minimize the negative conditionallog-likelihood for all documents. A loss function 143 is defined as:

$\begin{matrix}{\mathcal{L}_{auto} = {- {\sum\limits_{i = 1}^{m}{\log{p\left( {\left. {\overset{\hat{}}{x}}^{i} \middle| x^{i} \right.,\ \theta_{d},\theta_{e}} \right)}}}}} & (10)\end{matrix}$

Where θ_(e) and θ_(d) are the set of model parameters for the encoder141 and decoder 142, respectively. The trained auto-encoder E_(A) 102 isused to obtain the content representation z∈

^(d×1) 122 according to Eq. 5 where d is the size of textualrepresentation.

Adding Noise

Textual information is rich in content and publishing this data withoutproper anonymization lead to privacy breach and revealing the identityof an individual. This can let the adversary infer if a targeted user'slatent textual representation is in the database or which record isassociated with it. Moreover, publishing a document's latentrepresentation could result in leakage of the original text. In fact,recent advancement in adversarial machine learning shows that it ispossible to recover the input textual information from its latentrepresentation. In this case, if an adversary has preliminary knowledgeof the training model, they can readily reverse engineer the input, forexample, by a GAN attack algorithm. It is thus essential to protect thetextual information before publishing it.

The goal is thus to add noise to the initial latent representation z 122such that the differential privacy condition is satisfied. In oneembodiment, the initial latent representation z 122 is perturbed atnoise adder 104 by adding Laplacian noise as follows:

$\begin{matrix}{{{\overset{\hat{}}{z}(i)} = {{{z(i)} + {s(i)}} \sim {La{p(b)}}}},{b = \frac{\Delta}{\epsilon}},{i = 1},\ldots\mspace{14mu},d} & (11)\end{matrix}$

where ϵ is the privacy budget, Δ is the L₁-sensitivity of the latentlatent representation z, d the dimension of z, s the noise vector, s(i)and z(i) are the i-th element for vectors s and z, respectively. Δ=2d.Note that each element of the noise vector is drawn from Laplaciandistribution. The optimal privacy budget c is iteratively found usingthe semantic meaning discriminator D_(S) 106 and the private attributediscriminator D_(P) 108, and the process of adding noise s to theinitial latent representation z 122 runs concurrently with finding theoptimal privacy budget ϵ until an optimal modified latent representation2122 is reached.

Preserving Semantic Meaning: Semantic Meaning Discriminator

Referring to FIGS. 1 and 3, perturbing the latent representation z 122of the given text by adding noises to it (Eq. 11) prevents the adversaryfrom re-constructing the text from its latent representation andguarantees differential privacy. However, this approach may destroy thesemantic meaning of the text data. Semantic meaning is task-dependent,e.g., classification is one of the common tasks. In order to preservethe semantic meaning of the context 122 representation, it is necessaryto add an optimal amount of noise to the latent representation 122 whichdoes not destroy the semantic meaning of the text data while ensuringdata privacy. The challenge is approached using the semanticdiscriminator 106 by learning an optimal amount of added noise s denotedby privacy budget ∈ 125 in terms of training a classifier 161:

ŷ=softmax({circumflex over (z)};θ _(Ds))  (12)

where θ_(Ds) 166 are weights associated with the softmax function and ŷrepresents an inferred label 164 for classification.

To preserve the semantic meaning of the text representation, a noisylatent representation is needed which retains high utility andaccordingly includes enough information for a downstream task, e.g.,classification. The classifier 161 of the semantic discriminator D_(S)106 is defined that aims to assign a correct class label to a modifiedlatent representation {circumflex over (z)}(i) 124, whose loss function163 is minimized as follows,

$\begin{matrix}{{\min\limits_{\theta_{D_{S}},\epsilon}{\mathcal{L}\left( {\overset{\hat{}}{y},y} \right)}} = {\min\limits_{\theta_{D_{S}},\epsilon}{\sum\limits_{i = 1}^{C}{{- {y(i)}}\log{\overset{\hat{}}{y}(i)}}}}} & (13)\end{matrix}$

where C is the number of classes, and £ denotes the cross entropy lossfunction. A one-hot encoding of a ground truth 162 for theclassification task is also denoted by y and y(i) represents the i-thelement of y, i.e., the ground truth label for i-th class.

To learn the value of the privacy budget ∈ 125, a reparameterizationprocess is employed. Instead of directly sampling noise s(i) from aLaplacian distribution (i.e., Eq. 11), this process first samples avalue r from a uniform distribution, i.e. r˜[0,1], and then rewrites theamount of added noise s(i) as follows:

$\begin{matrix}{{{s(i)} = {{- \frac{\Delta}{\epsilon}} \times {{sgn}(r)}{\ln\left( \left. {1 - 2} \middle| r \right| \right)}}},{i = 1},2,{.\;.\;.}\;,d} & (14)\end{matrix}$

This is equivalent to sampling noise s from

${{Lap}\left( \frac{\Delta}{\epsilon} \right)}.$

The advantage of doing so is that the parameter ∈ is now explicitlyinvolved in the representation of the added noise, s, which makes itpossible to use back-propagation to find the optimal value of ∈. Largeprivacy budget ϵ could result in large privacy bounds. Hence, aconstraint, ∈<c₁ is added where c₁ is a predefined constraint.

Another challenge here is that ŷ is inferred from {circumflex over (z)}after introducing noise to the initial latent representation z. Thenoise is also sampled from the Laplacian distribution which results inlarge variance in the training process. To solve this issue and make themodel more robust, K copies of noise are sampled for each givendocument. In other words, Eq. 13 can be re-written as follows:

$\begin{matrix}{{\min\limits_{\theta_{D_{S}},\epsilon}{\mathcal{L}_{D_{S}}\left( {\overset{\hat{}}{y},y} \right)}} = {{\min\limits_{\theta_{D_{S}},\epsilon}{\frac{1}{K}{\overset{K}{\sum\limits_{k = l}}{\mathcal{L}\left( {{\overset{\hat{}}{y}}^{k},y} \right)}}}} = {{\min\limits_{\theta_{D_{S}},\epsilon}{\frac{1}{K}{\sum\limits_{k = l}^{K}{\sum\limits_{i = 1}^{C}{{- {y(i)}}\log{{\overset{\hat{}}{y}}^{k}(i)}\mspace{14mu}{s.t.\mspace{14mu}\epsilon}}}}}} \leq c_{1}}}} & (15)\end{matrix}$

where the goal is to minimize loss function

_(D) _(S) w.r.t. the parameters {θ_(D) _(S) , ∈} andŷ^(k)=softmax({tilde over (z)}^(k); θ_(D) _(S) ). Note that {circumflexover (z)}^(k)=z+s^(k) in which s^(k) is the k-th sample of the noisecalculated with Eq. 14.

Following minimization and resultant determination of a privacy budget ∈125, an error 126 is computed between predicted label ŷ 161 and groundtruth label y 162.

Private Attribute Discriminator and Privacy Preservation

Referring to FIGS. 1 and 4, the disclosure further addresses how addingnoise s to the latent representation z 122 of the text can preventadversaries from learning the input textual information and guaranteedifferential privacy. Another important aspect of learning privacypreserving text representation is to ensure that sensitive and privateinformation of the users such as age, gender, and location is notcaptured in the latent representation.

An adversary cannot design a private attribute inference attack betterthan what it has already anticipated. In this spirit, the idea ofadversarial learning is leveraged. In particular, it is necessary totrain the private attribute discriminator D_(p) 108 to accuratelyidentify the private information from the latent representation z 122,while learning the modified latent representation 2124 that can fool thediscriminator and minimize leakage of private attributes, which resultsin a representation that does not contain sensitive information. Privateattribute discriminator 108 uses a classifier 181 to attempt to predicta private attribute label 184 using a ground truth label 182.Ultimately, a goal of private attribute discriminator 108 would be tofind parameters that would prevent any classifier such as classifier 181from accurately predicting private attribute labels. Assume that thereare T private attributes (e.g., age, gender, location). Let p_(t)represent the ground truth 182 (i.e., correct label) for the t-thsensitive attribute and θ_(D) _(p) _(t) demonstrate parameters 186 ofdiscriminator model D_(P) 108 for the t-th sensitive attribute. Theadversarial learning can be formally written as:

$\begin{matrix}{{{\min\limits_{{\{\theta_{D_{P}^{t}}\}}_{t = 1}^{T}}{\max\limits_{\epsilon}\mathcal{L}_{D_{P}}}} = {\min\limits_{{\{\theta_{D_{P}^{t}}\}}_{t = 1}^{T}}{\max\limits_{\epsilon}{\frac{1}{K.T.}{\sum\limits_{t = 1}^{T}{\sum\limits_{k = 1}^{K}{\mathcal{L}_{D_{P}^{t}}\left( {{\overset{\hat{}}{p}}_{P}^{k},p_{t}} \right)}}}}}}},{{s.t.\mspace{14mu}\epsilon} \leq c_{1}}} & (16)\end{matrix}$

Where

_(D) _(P) _(t) denotes a cross entropy loss function and {circumflexover (p)}_(t) ^(k)=softmax({tilde over (z)}^(k), θ_(D) _(P) _(t) ) isthe predicted t-th sensitive attribute label 184 using the k-th sample.The outer minimization 183 finds the strongest private attributeinference attack and the inner maximization 185 seeks to fool thediscriminator by obscuring private information. In other words, theouter minimization 183 seeks convergence of the discriminator parameters186 while the outer maximization 185 seeks to find the privacy budgetvalue ∈ 125 to maximize a loss between a predicted label 184 of aprivate attribute and an actual ground truth label 182. The privateattribute discriminator 108 finds parameters θ_(D) _(p) _(t) 186 and aprivacy budget value ∈ 125 that cause the classifier 181 to fail toclassify private attributes. Following maximization and resultantdetermination of a privacy budget ∈ 125, an error 187 is computed basedon the predicted private attribute label {circumflex over (p)} 184 andground truth value 182 of the private attribute p.

Optimization Function

In the previous sections, it was discussed how to: (1) add noise toprevent the adversary from reconstructing the original text from thelatent representation and minimize the chance of privacy breach bysatisfying differential privacy (Eq. 11), (2) control the amount of theadded noise to preserve the semantic meaning of the textual informationfor a given task (Eq. 15), and (3) control the amount of the added noiseso that user's private information is masked (Eq. 16). Inspired by theidea of adversarial learning, all three are achieved at once by modelingthe objective function as a minmax game among the semantic meaningdiscriminator D_(S) 106 and the private attribute discriminator D_(P)108, as follows:

$\begin{matrix}{{{{\min\limits_{\theta_{D_{S}},\epsilon}{\max\limits_{{\{\theta_{D_{P}^{t}}\}}_{t = 1}^{T}}\mathcal{L}_{D_{S}}}} - {\alpha\mathcal{L}}_{D_{P}}} = {\min\limits_{\theta_{D_{S}},\epsilon}{\max\limits_{{\{\theta_{D_{P}^{t}}\}}_{t = 1}^{T}}{\frac{1}{K}{\sum\limits_{k = 1}^{K}\left\lbrack {{\mathcal{L}\left( {{\overset{\hat{}}{y}}^{k},y} \right)} - {\alpha\frac{1}{T}{\sum\limits_{t = 1}^{T}{\mathcal{L}_{D_{P}^{t}}\left( {{\overset{\hat{}}{p}}_{t}^{k},p_{t}} \right)}}}} \right\rbrack}}}}},{{s.t.\mspace{14mu}\epsilon} \leq c_{1}}} & (17)\end{matrix}$

where α controls the contribution of the private attribute discriminatorin the learning process. This objective function seeks to minimizeprivacy leakage with respect to the attack, minimize loss in thesemantic meaning of the textual representation, and protect privateinformation. With N documents, Eq. 13 is written as follows:

$\begin{matrix}{{{\min\limits_{\theta_{D_{S}},\epsilon}{\max\limits_{{\{\theta_{D_{P}^{t}}\}}_{t = 1}^{T}}{\frac{1}{N}{\sum\limits_{n = 1}^{N}\left\lbrack {\frac{1}{K}{\sum\limits_{k = 1}^{K}\left\lbrack {{\mathcal{L}\left( {{\overset{\hat{}}{y}}_{n}^{k},y_{n}} \right)} - {\alpha\frac{1}{T}{\sum\limits_{t = 1}^{T}{\mathcal{L}_{D_{P}^{t}}\left( {{\overset{\hat{}}{p}}_{n}^{k},p_{n,t}} \right)}}}} \right\rbrack}} \right\rbrack}}}} + {\lambda{\Omega(\theta)}}},{{s.t.\mspace{14mu}\epsilon} \leq c_{1}}} & (18)\end{matrix}$

Where θ={θ_(D) _(S) , ∈, {θ_(D) _(P) _(t) }_(t=1) ^(T)} is the set ofall parameters to be learned, Ω(θ) is the regularizer for the parameterssuch as Frobenius norm and λ is a scalar to control the amount ofcontribution of the regularization Ω(θ)

The aim of this objective function is to perturb the original textrepresentation by adding a proper amount of noise to it in order toprevent an adversary from inferring existence of the target textualrepresentation in the database, reconstructing the user's original textand learning user's sensitive information from the latentrepresentation, while preserving the semantic meaning of the modifiedrepresentation for a given specific task. It is stressed that theresultant text representation satisfies {tilde over (∈)}-differentialprivacy, where {tilde over (∈)}≤c₁ is the optimal learned privacybudget. This is further discussed below.

Algorithm 1: The Learning Process of DPTEXT model Input: Training dataχ, θ_(D) _(S) , ϵ, {θ_(D) _(P) ^(t)}_(t=1) ^(T), batch size b, c₁ and α.Output: The privacy preserving learned text representation {tilde over(z)} 1: Pre-train the document auto-encoder E_(α) to obtain the textrepresentations according to Eq. 5 as z = E_(A)(x, θ_(e)) 2: repeat 3:Sample a mini-batch of b samples {x^(i)}_(i=1) ^(b) from χ 4: Add noises to initial document representation z_(i) and get the new documentrepresentation {tilde over (z)}_(i), i = 1,2,..., b via Eq. 14 5: Trainsemantic discriminator D_(S) by gradient descent (Eq. 15) 6: Trainprivate attribute discriminator D_(P) via Eq. 16. 7: until Convergence

The optimization process is illustrated in Algorithm 1 and FIGS. 1-5.Block 210 of process flow 200 shows obtaining θ_(D) _(S) , ∈, {θ_(D)_(P) _(t) }_(t=1) ^(T) and training data χ. First, the initial latentrepresentation 122 of all documents

={z_(i), . . . , z_(N)} is obtained in Line 1, as further illustrated inFIGS. 1 and 2 and as further shown in block 220 of FIG. 5. Then, asillustrated in lines 2-7 and as shown in block 230 of FIG. 5 andelaborated on in FIG. 6, noise s is added to the initial latent textrepresentation z_(i) 122 to obtain a new modified latent representation{tilde over (z)}_(i) 124. s is iteratively optimized to retain semanticmeaning using semantic discriminator D_(S) 106 while preventing recoveryof private attributes using private attribute discriminator D_(P) 108.In particular, as shown in block 232 of FIG. 6, a mini-batch of bsamples from the training data are sampled and in block 234, noise s isadded to initial text representation 122. Next, the semanticdiscriminator D_(S) 106 is trained in Line 5 at block 236 and privateattribute discriminator D_(P) 108 is trained in Line 6 at block 238.Recall that there is a constraint on the variable ∈, i.e., ∈<c₁. Tosatisfy this constraint, the idea of the projected gradient descent isused, wherein the gradient descent is performed in one step, i.e. ∈−γ×εwhere γ is the learning rate. Then, the parameter ϵ is projected back tothe constraint. This means that if ∈<c₁, then set ∈=c₁, otherwise, keepthe value of ∈. The modified latent representation {tilde over (z)}_(i)124 can be then calculated for each given document 10 according to thevalue of optimal learned privacy budget ç≤c₁ using Eq. 11. Note that anymodel can be used for semantic discriminator D_(S) 106 and privateattribute discriminator D_(P) 108.

Theoretical Analysis

Here, it is shown that the learned text representation using DPText is{tilde over (∈)}-differential privacy where {tilde over (∈)}≤c₁ is thelearned optimal privacy budget. In particular, the privacy guarantee forthe final noisy latent representation {tilde over (z)}_(i) for eachgiven document is proven. The theoretical findings confirm the fact thatDPText minimizes the chance of revealing existence of textualrepresentations in the database.

Theorem 1. Let {tilde over (∈)}≤c₁ be the optimal value learned for theprivacy budget variable ∈ w.r.t the semantic meaning and privateattribute discriminators. Let z₁ be the original latent representationfor document x_(i), i=1, . . . , N inferred using Eq. 5 and. Moreover,let Δ denote the L₁-sensitivity of the textual latent representationextractor function discussed herein. If each element s_(i)(l), l=1, . .. , d in noise vector s_(i) is selected randomly from

${{{Lap}\left( \frac{\Delta}{\overset{\sim}{\epsilon}} \right)}\left( {\Delta = {2d}} \right)},$

the final noisy latent representation {tilde over (z)}_(i)=z_(i)+s_(i)satisfies {tilde over (∈)}-differential privacy

Proof. First the change of z is bound when one data point in thedatabase changes. This gives the L₁-sensitivity of the textual latentrepresentation extractor function discussed above.

Recall the way z is calculated using Eq. 5. Function tanh is used in GRUto build the RNN which is used above to find the latent representationof a given document. The output of tanh function is within range [−1,1].This indicates that value of each element z(1), l=1, . . . , d in thelatent representation vector z is within range [−1,1]. If one data pointchanges (i.e., removed from the database), the maximum change in valueof each element z(l) is 2. Since the dimension of z is d, the maximumchange in the L₁ norm of z happens when all of its elements, z(l), havethe maximum change. According to Definition. 2, the L₁-sensitivity of zis Δ=2×d.

Now, assume that {tilde over (∈)}≤c₁ is the optimal value for thelearned privacy budget. Then each element ins (i.e., s(l), l−1, 2, . . ., d) is distributed as

${Lap}\left( \frac{\Delta}{\overset{\sim}{\epsilon}} \right)$

based on Eq. 11 which is equal to randomly picking each s(l) from the

${Lap}\frac{\Delta}{\overset{\sim}{\epsilon}}$

distribution, whose probability density function is

$\begin{matrix}{{P{r\left( {s(l)} \right)}} = {\frac{\overset{˜}{\epsilon}}{2\Delta}{e^{- \frac{\overset{\sim}{\epsilon}{{s{(l)}}}}{\Delta}}.}}} & \;\end{matrix}$

Let

₁ and

₂ be any two datasets only differ in the value of one record. Withoutloss of generality it is assumed that the representation of the lastdocument is changed from z_(n) to z_(n)′. Since the L₁-sensitivity of zis Δ=2d, then ∥z_(n)−z_(n)″∥₁≤Δ. Then:

$\begin{matrix}{\frac{P{r\left\lbrack {{z_{n} + s_{n}} = \left. r \middle| \mathcal{D}_{1} \right.} \right\rbrack}}{\Pr\left\lbrack {{z_{n}^{\prime} + s_{n}^{\prime}} = \left. r \middle| \mathcal{D}_{2} \right.} \right\rbrack} = {\frac{\prod_{l \in {\{{1,2,\ldots\mspace{14mu},d}\}}}{\Pr\left( {r - {z_{n}(l)}} \right)}}{\prod_{l \in {\{{1,2,\ldots\mspace{14mu},d}\}}}{\Pr\left( {r - {z_{n}^{\prime}(l)}} \right)}} = {\frac{\prod_{l \in {\{{1,2,\ldots\mspace{14mu},d}\}}}{\Pr\left( {s_{n}(l)} \right)}}{\prod_{l \in}{\left\{ {1,2,\ldots\mspace{14mu},d} \right\}{\Pr\left( {s_{n}^{\prime}(l)} \right)}}} = {{e^{- \frac{{\overset{\sim}{\in}\Sigma_{l}}|{s_{n}{(l)}}|}{\Delta}}/e^{- \frac{{\overset{\sim}{\in}\Sigma_{l}}|{s_{n}^{\prime}{(l)}}|}{\Delta}}} = {{{e\frac{\overset{\sim}{\in}{{\Sigma_{l}{{s_{n}^{\prime}(l)}}} - {{s_{n}(l)}}}}{\Delta}} \leq {e\frac{\overset{\sim}{\in}{\Sigma_{l}\left( {{{s_{n}^{\prime}(l)} - {s_{n}(l)}}} \right)}}{\Delta}}} = \frac{\overset{\sim}{\in}{{s_{n}^{\prime} - s_{n}}}_{1}}{\Delta}}}}}} & ({l9})\end{matrix}$

where s_(n) and s_(n)′ are the corresponding noise vectors with respectto the learned {tilde over (∈)} when the input are

₁ and

₂, respectively. The first inequality also follows from the triangleinequality, i.e. |a|−|b|≤|a−b|. The last equality follows from thedefinition of L₁-norm.

Since s_(n)=r−z_(n) and s_(n)′=r−z_(n)′ then:

∥s _(n) ′−s _(n)∥₁=∥(r−z _(n)′)−(r−z _(n))∥₁ =∥z _(n) ′−z _(n)∥₁≤Δ  (20)

This follows from the definition of L1-sensitivity. Eq. 19 isre-written:

$\begin{matrix}{{\frac{P{r\left\lbrack {{z_{n} + s_{n}} = \left. r \middle| \mathcal{D}_{1} \right.} \right\rbrack}}{\Pr\left\lbrack {{z_{n}^{\prime} + s_{n}^{\prime}} = \left. r \middle| \mathcal{D}_{2} \right.} \right\rbrack} \leq e^{\frac{{\overset{˜}{\epsilon}}_{{{s_{n}^{\prime} - s_{n}}}_{1}}}{\Delta}} \leq e^{\frac{\overset{˜}{\epsilon}\Delta}{\Delta}}} = e^{\overset{˜}{\epsilon}}} & (21)\end{matrix}$

So, the theorem follows and the final noisy latent representation is{tilde over (ϵ)}-differentially private.

Experimental Results

In this section, experiments are conducted on real-world data todemonstrate the effectiveness of DPT_(EXT) in terms of preserving bothprivacy of users and utility of the resultant representation for a giventask. Specifically, this section aims to answer the following questions:

Q1—Utility: Does the learned text representation preserve the semanticmeaning of the original text for a given task?

Q2—Privacy: Does the learned text representation obscure users' privateinformation?

Q3—Utility-Privacy Relation: Does the improvement in privacy of learnedtext representation result in sacrificing the utility?

To answer the first question (Q1), experimental results for DPT_(EXT)were reported with respect to two well-known text-related tasks, i.e.,sentiment analysis and part-of-speech (POS) tagging. Sentiment analysisand POS tagging have many applications in Web and user-behavioralmodeling. Recent research showed how linguistic features such assentiment are highly correlated with users' demographic information.Another group of research shows the effectiveness of POS tags inpredicting users' age and gender information. This makes usersvulnerable against inference of their private information. Therefore, toanswer the second question (Q2), different private information, i.e.,age, location, and gender, and report results for private attributeprediction task are considered. To answer the third question (Q3), theutility loss is investigated against privacy improvement of the learnedtext representation

Data. A dataset from TrustPilot is used. On TrustPilot, users can writereviews and leave a one to five star rating. Users can also provide somedemographic information. In the collected dataset, each review isassociated with three attributes, gender (male/female), age, andlocation (Denmark, Germany, France, United Kingdom, and United States).First, all non-English reviews based on LANGID.PY are discarded, andonly reviews classified as English with a confidence greater than 0.9are kept. Age attribute is categorized into three groups, over-45,under-35, and between 35 and 45. 10,000 reviews are subsampled for eachlocation to balance the five locations. Each review's rating score isconsidered as the target sentiment class.

Model and Parameter Settings. For the document auto-encoder A, asingle-layer RNN is used with GRU cell of input/hidden dimension withd=64. For semantic and private attribute discriminators, feed-forwardnetworks are used with single hidden layer with the dimension of hiddenstate set as 200, and a sigmoid output layer, which is determinedthrough grid search. The parameters α and λ are determined throughcross-validation, and are set as α=1 and λ=0.01. The upper-boundconstraint c₁ for the value of parameter ∈ is also set as c₁=0.1 toensure the ∈-differential privacy, ∈=0.1 for the learned representation.

Part of Speech Tagging

Part-of-speech (POS) tagging is another language processing applicationwhich is framed as a sequence tagging problem.

Data. For this task a manually POS tagged version of TrustPilot datasetin English is used. This data is obtained and includes 600 sentences,each tagged with POS information based on a Google Universal POS tagsetand also labeled with both gender and age of the users. The genderattribute is categorized into male and female, and age attribute iscategorized into two groups over-45, under-35. Web English Tree-bank(WebEng) is used as a pre-training tagging model because of the smallquantity of text available for this task. WebEng is similar toTrustPilot datasets with respect to the domain as both contains unediteduser generated textual data

Model and Parameter Settings. Similar to the sentiment analysis task, asingle-layer RNN is used with GRU cell of input/hidden dimension withd=64 for document auto-encoder A 104. For semantic discriminator 106(i.e., POS tag predictor), a bi-directional long short-term memorynetwork is used:

h _(i)=LSTM(x ^(i) ,h _(i−1);θ_(h)), h _(i)′=LSTM(x ^(i) ,h_(i+1)′;θ_(h)′) y _(i)=Categorical(ϕ([h _(i) ;h _(i)′]);θ₀)  (22)

Where x^(i)|_(i=1) ^(m) is the input sequence with m words, h_(i) is thei-th hidden state, h₀ and h_(m+1)′ are terminal hidden states set tozero, [.;.] denotes vectors concatenation and ϕ is a lineartransformation. The dimension of the hidden layer is set as 200. Adropout rate of 0.5 is applied to all hidden layers during training

For the private attribute discriminator 108, feed-forward networks areused with single hidden layer with the dimension of hidden state set as200, and a sigmoid output layer (determined via grid search). The inputto this network is final hidden representation [h_(m); h₀′]. Forhyperparameters, values of α and λ are set as α=1 and λL=0.01 which aredetermined through cross-validation. The upper-bound constraint for thevalue of E is also set as c₁=0.1.

Experimental Design

Ten-fold cross validation was performed for POS tagging and sentimentanalysis tasks. State-of-the-art research is followed and accuracy scorereported to evaluate the utility of the generated data for the given POStagging or sentiment analysis task. In particular, for the sentimentprediction task, accuracy was reported for correctly predicting ratingof reviews. Tagging accuracy for POS tagging task was also reported. Toexamine the text representation in terms of obscuring privateattributes, test performance was reported in terms of F1 score forpredicting private attributes. Note that the private attributes forsentiment task include age, gender and location while private attributesfor tagging task include gender and age.

DPText is reported in both tasks with the following baselines:

ORIGINAL: This is a variant of DPText and publishes the originalrepresentation z 122 without adding noise or utilizing D_(S)discriminator 106 or D_(P) discriminator 108.

DIFPRIV: This baseline adds Laplacian noise to the originalrepresentation z 122 according to Eq. 11

$\left( {{i.e.},{{Lap}\left( \frac{\Delta}{\epsilon} \right)},{\epsilon = {0.1}},{\Delta = {2d}}} \right)$

without utilizing D_(S) and D_(P) discriminators 106 and 108. Note thatthis method makes the final representation e-differentially private. Themodel was compared against this method to investigate the effectivenessof semantic and private attribute discriminators 106 and 108.

ADV-ALL: This method utilizes the idea of adversarial learning and hastwo components, generator, discriminator. It generates a textrepresentation that has high quality for the given task but has poorquality for inference of private attributes. The model was comparedagainst this method to see how well adding optimal value of noise canpreserve privacy in practice

In both tasks, semantic discriminator D_(S) 106 is trained on thetraining data and applied to test data for predicting sentiment and POStags. Similarly, private attribute discriminator D_(P) 108 can beapplied where it plays the role of an adversary trying to infer theprivate attributes of the user based on the textual representation.Private attribute discriminator D_(P) 108 is also trained on thetraining data and applied to test data for evaluation. Higher accuracyscore for semantic discriminator D_(S) 106 indicates that representationhas high utility for the given task, while lower F1 score for privateattribute discriminator D_(P) 108 demonstrates that the textualrepresentation has higher privacy for individuals due to obscuring theirprivate information

Experimental Results

Performance Comparison. For evaluating the quality of the learned textrepresentation, questions Q1, Q2 and Q3 are answered for two differentnatural language processing tasks, i.e., sentiment prediction and POStagging. The experimental results for different methods are demonstratedin Table 1.

TABLE 1 Accuracy for sentiment prediction and POS tagging and F1 forevaluating private attribute prediction task. (a) Sentiment PredictionTask Sentiment Private Attribute (F1) Model (Acc) Age Loc Gen ORIGINAL0.7493 0.3449 0.1539 0.5301 DIFPRIV 0.7397 0.3177 0.1411 0.5118 ADV-ALL0.7165 0.3076 0.1080 0.4716 DPTEXT 0.7318 0.1994 0.0581 0.3911 (b) POSTagging Task POS Tagging Private Attribute (F1) Model (Acc) Age GenORIGINAL 0.8913 0.4018 0.5627 DIFPRIV 0.8982 0.3911 0.5417 ADV-ALL0.8901 0.3514 0.5008 DPTEXT 0.9257 0.2218 0.3865

Utility (Q1):

Sentiment Prediction Task. The results of sentiment prediction forDPT_(EXT) is comparable to the ORIGINAL approach. This means that therepresentation by DPT_(EXT) preserves the semantic meaning of thetextual representation according to the given task (i.e., high utility).DIFPRIv performs slightly better than DPT_(EXT) in preserving semanticmeaning of the text. The reason is that DPText applies noise at least asstrong as DIFPRIV (or even more) and adding more noise results in biggerutility loss. Despite of adding more noise than DIFPRIV, the accuracy ofDPT_(EXT) is still comparable with DIFPRIV. This confirms the role ofsemantic meaning discriminator D_(S) in preserving utility and semanticmeaning as it explicitly takes utility into consideration when addingnoise. It is also observed that DPT_(EXT) has better performance interms of predicting sentiment in comparison to AD V-ALL. DPT_(EXT) isdifferent from AD V-ALL as it manipulates the original textrepresentation by adding noise to it while AD V-ALL generates a privacypreserving text representation from scratch. The benefit of DPT_(EXT)over AD V-ALL is two-fold. First, the framework does not depend on theprocess which generates the original representation. In other words,this representation could be generated via any model such as doc2vec.Second, adding Laplacian noise to the text representation preventsadversary from learning the original input text through reverseengineering by a GAN attack algorithm and also minimizesre-identification of users by guaranteeing ∈-differential privacy

POS Tagging Task. The accuracy of POS tagging task is higher when DPTextis utilized rather than when ORIGINAL is used. This is because POStagging results are biased toward gender, age and location. In otherwords, this information affects the performance of tagging task.Removing private information from the latent representation results inremoving this type of bias for tagging task. Therefore, the learnedrepresentation is more robust and results in a more accurate tagging.DPText also has better performance than DIFPRIV due to removal ofprivate information and thus bias. Besides, results demonstrate thatDPText outperforms ADV-ALL. These results indicate the effectiveness ofDPText in preserving semantic meaning of the learned text representation

Privacy (Q2):

Sentiment Prediction Task. In the sentiment prediction task, DPTEXT hassignificantly lower F1 score for inferring all three private attributesin comparison to ORIGINAL. This shows that DPTEXT outputs textrepresentations that outperforms ORIGINAL in terms of obscuring privateinformation. Moreover, it was also observed that DPTEXT hassignificantly better performance in hiding private information thanDIFPRIV. This indicates that solely adding noise and satisfyingϵ-differential privacy does not protect textual information againstother types of attacks and leakage of users' private attributes. Thisfurther demonstrates the importance of private attribute discriminatorDP in obscuring users' private information. It is also observed that thelearned textual representation via DPTEXT hides more private informationthan AD V-ALL (lower F1 score). These results indicate that DPTEXT cansuccessfully obscure private information

POS Tagging Task. In the POS tagging task, F1 scores of DPText forpredicting gender and age private attributes are significantly lowerthan ORIGINAL approach. These results demonstrate the effectiveness ofDPText in obscuring users' private attribute. Similarly, comparing F1scores of DPText and DIFPRIV shows that the final text representationoutput of DPText contains less private attribute information. Thisconfirms the incapability of DIFPRIV in obscuring users' privateinformation, and clearly shows the effectiveness of private attributediscriminator DP. This confirms that satisfying differential privacydoes not necessarily protect against other types of attacks such asleakage of users' private attributes. Moreover, DPText outperforms ADV-ALL method in terms of hiding user's age and gender information. Itconfirms that the learned textual latent representation by DPTextpreserves privacy by eliminating their sensitive information withrespect to the POS tagging task.

Utility-Privacy Relation (Q3):

Sentiment Prediction Task. For the sentiment prediction task, DPText hasachieved the highest accuracy and thus reached the highest utility incomparison to other methods. It also has comparable utility results toORIGINAL. However, ORIGINAL utility is preserved at the expense ofsignificant privacy loss. In other words, ORIGINAL is not able toobscure users' private attribute information. Moreover, although DIFPRIVsatisfies differential privacy and its performance is comparable withDPText for predicting sentiment, it performs poorly in obscuring privateinformation. DIFPRIV may provide weaker privacy guaranty comparing withDPText since learned E in DPText can be smaller than ∈=0.1 in DIFPRIV.In contrast, DPText has significantly better (best) results in terms ofprivacy compared to the other approaches and also achieves the leastutility loss in comparison to AD V-ALL. These results show that DPTextnot only protect users' privacy with respect to two different types ofattacks, but also preserves semantic meaning of the given text withrespect to to the task in hand.

POS Tagging Task. For the POS tagging task, the resultant representationfrom DPText achieves the highest utility in comparison to all otherbaselines. It also provides a more accurate tagging than ORIGINALapproach as it removes the bias from the textual representation byobscuring age and gender attributes information. Moreover, DPText hasthe lowest F1 scores for predicting age and gender attributes amongstall approaches meaning that it performs the best in obscuring users'private attributes information. These results show the effectiveness ofDPText in preserving semantic meaning and obscuring private informationfor more accurate tagging.

The results for two natural language processing tasks indicate thatDPText learns a textual representation that (1) does not contain privateinformation, (2) guaranties differential privacy and thus protects usersagainst leakage of their identity, and (2) preserves the semanticmeaning of the representation for the given task.

Impact of Different Components. In this subsection, the impact ofdifferent private attribute discriminators on obscuring users' privateinformation is investigated. To achieve this goal, three variants of thedisclosed framework are explored, i.e., DPTEXTAGE, DPTEXTGEN, andDPTEXTLOC. In each of these variants, the model is trained withdiscriminator of just one of the private attributes. For example,DPTEXTAGE is trained solely with age discriminator and does not use anyother private attribute discriminators during training phase. Theperformance comparisons for both sentiment prediction and POS taggingtasks are shown in Table 2.

TABLE 2 Impact of different private attribute discriminators on DPTextfor sentiment prediction and POS tagging tasks. (a) Sentiment PredictionTask Sentiment Private Attribute (F1) Model (Acc) Age Loc Gen DPTEXT0.7318 0.1994 0.0581 0.3911 DPTEXTAGE 0.7573 0.2248 0.1012 0.3982DPTEXTLOC 0.7360 0.2861 0.0731 0.4100 DPTEXTGEN 0.7347 0.2997 0.06230.4053 (b) POS Tagging Task POS Tagging Private Attribute (F1) Model(Acc) Age Gen DPTEXT 0.9257 0.2218 0.3865 DPTEXTAGE 0.9218 0.2111 0.4179DPTEXTGEN 0.9361 0.2412 0.3916

Sentiment Prediction Task. In sentiment prediction task, it is observedthat using solely one of the private attribute discriminators can resultin a representation which performs better in terms of sentimentprediction, in comparison to DPText in which all three privateattributes discriminators are used (i.e., higher utility). This showsthat obscuring all private attributes results in adding more noise andthus losing more of quality of resultant text representation. However,these variants perform poorly in terms of obscuring private attributesin comparison to the original DPText model. This shows that obscuring aspecific private attribute can help with hiding information of otherprivate attributes as well. This is because of the hidden relationshipbetween different private attributes. In summary, these results indicatethat although using one discriminator in the training process can helpin preserving more semantic, it can compromise the effectiveness oflearned representation in obscuring attributes

POS Tagging Task. In the POS tagging task, results show that DPTextachieves the best performance in tagging task (i.e., higher utility) incomparison to other methods that solely use one of the private attributediscriminators. The reason is that presence of age and gender relatedinformation in the text can negatively affect the tagging performancedue to existing bias. Therefore, DPTEXT is thus more effective inremoving information of all private attributes and hidden existing biasin comparison to DPTEXTAGE and DPTEXTGEN. Removing bias leads to moreaccurate tagging. Similar to sentiment prediction task, it is observedthat DPTEXTGEN with only gender attribute discriminator is lesseffective than DPTEXT in terms of hiding private attributes information.DP-TEXTAGE however, has the best results in terms of obscuring ageattribute information while it is less effective in terms of hidinggender attribute information. This shows the hidden relationship betweendifferent private attributes.

Parameter Analysis. DPText has one important parameter α which controlsthe contribution from private attribute discriminator D_(P). The effectof this parameter is investigated by varying it as [0.125, 0.25, 0.5, 1,2, 4, 8, 16]. ORIGINAL-{AGE/GEN/Loc} shows the results for thecorresponding task when the original text representation has beenutilized. Results are shown in FIGS. 7A and 7B, and FIGS. 7C and 7D forsentiment prediction and POS tagging, respectively.

Parameter α controls the contribution of private attributediscriminator. However, it is surprisingly observed that in bothsentiment prediction and POS tagging tasks with the increase of α, theF1 scores for prediction of different private attributes decrease atfirst up to the point that α=1 and then it increases. This means thatthe private attributes were obscured more accurately at the beginningwith the increase of α and less later. Moreover, with the increase of α,the accuracy of sentiment prediction task decreases. This shows thatincreasing the contribution of private attribute discriminator lead todecrease in the utility of resultant text representation. In case of POStagging, the accuracy first increases and then decreases after α=1. Thisshows that removing the age and gender attributes related informationresults in removing the bias from learned text representation andimprove the tagging task. However, after α=1 the utility of resultantrepresentation decreases. Those patterns are useful for selecting thevalue of parameter α in practice

Moreover, in both tasks, setting α=0.125 results in an improvement interms of the amount of hidden private information in comparison to theresults of using the original representation. This observation supportsthe importance of the private attribute discriminator. Anotherobservation is that, after α=1, continuously increasing α degrades theperformance of hiding private attributes (i.e., increasing F1 scores) inboth sentiment prediction and POS tagging tasks. This is because themodel could overfit by increasing α which lead to an inaccurate learnedtext representation in terms of preserving private attributes andsemantic meaning of the text.

FIG. 8 is a schematic block diagram of an example device 300 that may beused with one or more embodiments described herein, e.g., as a componentof framework 100.

Device 300 comprises one or more network interfaces 310 (e.g., wired,wireless, PLC, etc.), at least one processor 320, and a memory 340interconnected by a system bus 350, as well as a power supply 360 (e.g.,battery, plug-in, etc.).

Network interface(s) 310 include the mechanical, electrical, andsignaling circuitry for communicating data over the communication linkscoupled to a communication network. Network interfaces 310 areconfigured to transmit and/or receive data using a variety of differentcommunication protocols. As illustrated, the box representing networkinterfaces 310 is shown for simplicity, and it is appreciated that suchinterfaces may represent different types of network connections such aswireless and wired (physical) connections. Network interfaces 310 areshown separately from power supply 360; however, it is appreciated thatthe interfaces that support PLC protocols may communicate through powersupply 360 and/or may be an integral component coupled to power supply360.

Memory 340 comprises a plurality of storage locations that areaddressable by processor 320 and network interfaces 310 for storingsoftware programs and data structures associated with the embodimentsdescribed herein. In some embodiments, device 300 may have limitedmemory or no memory (e.g., no memory for storage other than forprograms/processes operating on the device and associated caches).

Processor 320 comprises hardware elements or logic adapted to executethe software programs (e.g., instructions) and manipulate datastructures 345. An operating system 342, portions of which are typicallyresident in memory 340 and executed by the processor, functionallyorganizes device 300 by, inter alia, invoking operations in support ofsoftware processes and/or services executing on the device. Thesesoftware processes and/or services may comprise DPText process/services344, described herein. Note that while DPText process/services 344 isillustrated in centralized memory 340, alternative embodiments providefor the process to be operated within the network interfaces 310, suchas a component of a MAC layer, and/or as part of a distributed computingnetwork environment.

It will be apparent to those skilled in the art that other processor andmemory types, including various computer-readable media, may be used tostore and execute program instructions pertaining to the techniquesdescribed herein. Also, while the description illustrates variousprocesses, it is expressly contemplated that various processes may beembodied as modules or engines configured to operate in accordance withthe techniques herein (e.g., according to the functionality of a similarprocess). In this context, the term module and engine may beinterchangeable. In general, the term module or engine refers to modelor an organization of interrelated software components/functions.Further, while the DPText process 344 is shown as a standalone process,those skilled in the art will appreciate that this process may beexecuted as a routine or module within other processes.

It should be understood from the foregoing that, while particularembodiments have been illustrated and described, various modificationscan be made thereto without departing from the spirit and scope of theinvention as will be apparent to those skilled in the art. Such changesand modifications are within the scope and teachings of this inventionas defined in the claims appended hereto.

What is claimed is:
 1. A method of generating a modified latent textrepresentation for a document, comprising: utilizing a processor incommunication with a tangible storage medium storing instructions thatare executed by the processor to perform operations comprising:generating an initial latent representation representative of text in adocument; inferring an amount of noise to be added to the initial latentrepresentation by finding a privacy budget value that minimizes a lossbetween a predicted semantic label and a ground truth semantic label forthe initial latent representation and maximizes a loss between apredicted private attribute label and a ground truth private attributelabel for the initial latent representation; and adding the amount ofnoise to the initial latent text representation to generate a modifiedlatent text representation.
 2. The method of claim 1, wherein theinitial latent representation is generated using an auto-encoder trainedto generate the initial latent representation from the text in thedocument.
 3. The method of claim 2, further comprising training theautoencoder by: generating an encoded latent representationrepresentative of the text in the document by applying an encoder to thedocument; constructing a reconstructed document including reconstructedtext representative of the text in the encoded latent representation byapplying a decoder to the encoded latent representation; and identifyinga plurality of autoencoder parameters that minimize a loss between thetext in the document and the reconstructed text in the reconstructeddocument.
 4. The method of claim 1, further comprising: optimizing a setof semantic discriminator classifier parameters and a privacy budgetvalue that minimize the loss between the predicted semantic label andthe ground truth semantic label for the initial latent representation.5. The method of claim 4, further comprising: adding a first amount ofnoise to the initial latent representation to generate a modified latentrepresentation.
 6. The method of claim 4, further comprising: generatingthe predicted semantic label by applying a classifier to the modifiedlatent representation; minimizing a loss between the predicted semanticlabel and the ground truth semantic label; and identifying the set ofsemantic discriminator classifier parameters associated with the lowestloss value between the predicted semantic label and the ground truthsemantic label.
 7. The method of claim 1, further comprising: selectingthe privacy budget value that is associated with a lowest loss valuebetween the predicted semantic label and the ground truth semantic labeland that is associated with a highest loss value between the predictedprivate attribute label and a ground truth private attribute label. 8.The method of claim 1, further comprising: optimize a set of privateattribute discriminator parameters and a privacy budget value thatmaximize the loss between the predicted private attribute label and theground truth private attribute label for the initial latentrepresentation.
 9. The method of claim 8, wherein the optimization ofthe set of private attribute discriminator parameters is modeled as aminmax game.
 10. The method of claim 8, further comprising: generatingthe predicted private attribute label by applying a classifier to themodified latent representation; maximizing a loss between the predictedprivate attribute label and the ground truth private attribute label;and selecting the set of private attribute discriminator classifierparameters associated with the lowest loss value between the predictedprivate attribute label and the ground truth private attribute label.11. The method of claim 8, further comprising: determining the amount ofnoise to add based on the privacy budget value by sampling a value rfrom a uniform distribution such that:${{s(i)} = {{- \frac{\Delta}{\epsilon}} \times {{sgn}(r)}{\ln\left( \left. {1 - 2} \middle| r \right| \right)}}},{i = 1},2,{.\;.\;.}\;,d$wherein ϵ is a privacy budget, Δ is an L₁-sensitivity of the initiallatent representation, d is a dimension of the initial latentrepresentation, s is a noise vector and s(i) is the i-th element fornoise vector s.
 12. The method of claim 1, wherein the step of findingthe privacy budget value is iteratively repeated until convergence. 13.The method of claim 12, wherein the process of adding the amount ofnoise to the initial latent text representation to generate a modifiedlatent text representation runs concurrently with finding the privacybudget value.
 14. A computer system for generating a modified latenttext representation for a document, comprising: at least one processorin communication with a memory and operable for execution of a pluralityof modules, the plurality of modules including: an auto-encoderconfigured to generate an initial latent representation representativeof text in a document; a noise adder module configured to receive theinitial latent representation and add an amount of noise to the initiallatent representation to generate a modified latent text representationbased on a privacy budget value; a semantic meaning discriminator moduleconfigured to optimize a set of semantic discriminator classifierparameters and the privacy budget value such that a loss is minimizedbetween the predicted semantic label and the ground truth semantic labelfor the initial latent representation; and a private attributediscriminator module configured to optimize a set of private attributediscriminator parameters and the privacy budget value such that a lossis maximized between the predicted private attribute label and theground truth private attribute label for the initial latentrepresentation.
 15. The computer system of claim 14, wherein theauto-encoder module is configured to: generate an encoded latentrepresentation representative of the text in the document by applying anencoder to the document; construct a reconstructed document includingreconstructed text representative of the text in the encoded latentrepresentation by applying a decoder to the encoded latentrepresentation; and identify a plurality of autoencoder parameters thatminimize a loss between the text in the document and the reconstructedtext in the reconstructed document.
 16. The computer system of claim 14,wherein the semantic meaning discriminator module is configured to:generate the predicted semantic label by applying a first classifier tothe modified latent representation; minimize the loss between thepredicted semantic label and the ground truth semantic label; andidentify the set of semantic discriminator classifier parametersassociated with the lowest loss value between the predicted semanticlabel and the ground truth semantic label.
 17. The computer system ofclaim 16, wherein the first classifier is implemented using a recurrentneural network that takes the set of semantic discriminator classifierparameters and the modified latent text representation as input.
 18. Thecomputer system of claim 14, wherein the private attribute discriminatormodule is configured to: generate the predicted private attribute labelby applying a second classifier to the modified latent representation;maximize a loss between the predicted private attribute label and theground truth private attribute label; and select the set of privateattribute discriminator classifier parameters associated with the lowestloss value between the predicted private attribute label and the groundtruth private attribute label.
 19. The computer system of claim 18,wherein the second classifier is implemented using a recurrent neuralnetwork that takes the set of private attribute discriminator classifierparameters and the modified latent text representation as input.
 20. Thecomputer system of claim 18, wherein the privacy budget value isassociated with a lowest loss value between the predicted semantic labeland the ground truth semantic label and a highest loss value between thepredicted private attribute label and a ground truth private attributelabel.